Potential function of simplified protein models for discriminating native proteins from decoys: 
Combining contact interaction and local sequence-dependent geometry 
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Abstract — An effective potential function is critical for pro- 
tein structure prediction and folding simulation. For simpli- 
fied models of proteins where coordinates of only C a atoms 
need to be specified, an accurate potential function is impor- 
tant. Such a simplified model is essential for efficient search 
of conformational space. In this work, we present a formula- 
tion of potential function for simplified representations of pro- 
tein structures. It is based on the combination of descriptors 
derived from residue-residue contact and sequence-dependent 
local geometry. The optimal weight coefficients for contact and 
local geometry is obtained through optimization by maximiz- 
ing margins among native and decoy structures. The latter 
are generated by chain growth and by gapless threading. The 
performance of the potential function in blind test of discrim- 
inating native protein structures from decoys is evaluated us- 
ing several benchmark decoy sets. This potential function have 
comparable or better performance than several residue-based 
potential functions that require in addition coordinates of side 
chain centers or coordinates of all side chain atoms. 

Key words — Decoy discrimination, potential function, pro- 
tein structure prediction, simplified protein models. 

1. INTRODUCTION 

Studies of protein folding are often based on the thermody- 
namic hypothesis, which postulates that the native state of a 
protein is the state of lowest free energy under physiological 
conditions. Based on this assumption, a potential function 
where the native protein has the lowest energy is essential 
for protein structure prediction, folding simulation, and pro- 
tein design. 

There are two steps to construct such a potential function. 
The first step is to define a proper representation of protein 
structures, which is usually based on a set of numerical de- 
scriptors characterizing the structure and sequence of the 
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protein. The second step is to decide on a functional form, 
which takes the descriptor vector and maps it to a real val- 
ued energy or score for the particular structure. Frequently, 
potential function H(.) takes the form of H(f(s,a)) = 
H(c) — w ■ c, where the structure s of a protein and its 
amino acid sequence a is mapped to a numerical d-vector 
c by the representation /, such that / : (s, a) i— > c s 
The potential function H(c) £ K then maps the vector c 
to a real valued energy [1]. Protein representations strongly 
influence the effectiveness of potential function. The most 
successful potential functions rely on detailed all-atom rep- 
resentation of native protein structures with tens or hundreds 
thousand of parameters. Most residue-level potential func- 
tions with significant less parameters still need atom-level 
information. However, simplified representation of proteins 
are essential for structure prediction and folding simulation. 
Since not all heavy atoms are explicitly represented, sam- 
pling of conformational space is far more efficient. A chal- 
lenging open problem is whether potential function based 
on a simplified protein representation can have perfect dis- 
crimination of native or near native conformations from de- 
coy conformations. If so, it is important to identify the min- 
imal representation through which such discrimination can 
be achieved. 



In this work, we present a formulation of potential func- 
tion designed for simplified n-state representations of pro- 
tein structures at residue level. In addition to contact inter- 
actions, we encode sequence-dependent local geometric in- 
formation. These are combined to form a potential function 
called Contact and Local Geometric Potential (CLGP). Our 
paper is organized as following: first we introduce the sim- 
plified 7i-state representation of protein structures. We then 
describe the scoring function, which is a weighted linear 
combination of contact and geometric parameters. We also 
discuss the optimization method to derive the parameters of 
the potential function. Finally, we show the performance of 
CLGP on both Park-Levitt decoy set [2] and decoys gener- 
ated by gapless threading. 




Fig. 1. a. Discrete state model, on and Tj are shown for 
residue i. Bond angle on at position i is formed by C a .i-i, 
C ai i, and C at i+x. Torsion angle t% is the dihedral angle of 
the two planes formed by atoms (Ci_2,Ci_i,C») and(Ci_i, 
Ci, Cj+i); b. Distribution of a and r angles. 

2. MODELS AND METHODS 

Representation of protein structures. Our repre- 
sentation of protein structures is an off-lattice discrete state 
model [3]. Specifically, we represent all backbone atoms by 
the C a atom, and all side chain atoms by one additional 
atom attached to the main chain C a atoms (Figure la). 
There are totally 20 different atom types (1 C„ and 19 side 
chain atoms). The backbone structure is described by the 
bond angle a and torsion angle t at each C a position (Fig- 
ure la). The overall three dimensional structure of a protein 
of length N is completely determined by the set of the bond 
angles and torsion angles {(a^Ti)}, 1 < i < N, where 
i represents the i-th position of the backbone. The distance 
between each C a atom and its side chain atom is predefined 
for each residue type. 

In discrete state model, except residues at the two ends 
each main chain position can take n discrete pairs of (a, r) 
angles, which is the states at this position. To determine 
the optimal number of states and the exact values of the 
(a, r) angles for the discrete representation, we examine 
the distribution of a and r angles calculated from 980 non- 
homologous proteins from PDBSELECT. Analogous to 
the Ramachandran plot, the distribution of (a, r) also has 
densely and sparsely populated regions, which corresponds 
to different secondary structure types (Fig lb). And the dis- 
tribution differs for different amino acids. To obtain exact 
values for each discrete states, we use fc-mean clustering 
to group a and r values observed in native proteins into k 
from 3 to 10 clusters for each amino acid residue type. The 
cluster centers are then taken as the discrete states. 

The discrete state representation is associated with dis- 
crete conformational space that is different from the contin- 
uous space of PDB structures. To represent PDB structure in 
the simplified discrete space, we need to map an PDB struc- 
ture to a structure in the discrete space, with the requirement 
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Fig. 2. a. Average RMSD of the best discrete state models 
fitted to PDB structures of 348 proteins from 3 to 10 state; 
b. Hierarchical clustering of amino acids by local sequence 
geometry propensity. 

that these two structures are most similar by some criteria. 
In this study we consider two such criteria: global structural 
similarity and local structural similarity. To generate a dis- 
crete structures that are globally similar to a PDB struc- 
ture s, we use a heuristic "build-up" algorithm [4], which 
globally fits the discrete state model s m j n to the PDB struc- 
ture s, namely, s m j n = arg Sd min [RMSD(s, s^)]. We ob- 
tain discrete best-fit structure for 348 proteins, with length 
ranging from 40 to more than 1,000. The average RMSD 
to the PDB structure of the best-fit structures is 2.4 A for 
4-state model, 1.8 A for 6-state, and 1.1 Afor 10-state. Fig- 
ure 2a shows the average RMSD for models with 3 to 10 
states. The relative high quality of these fitted discrete state 
structures indicates that a model with four to six states is 
sufficient to generate near native structures (NNS) that have 
small RMSD to experimental structures. In this study, we 
use 4-state model for all amino acid residues. To generate 
discrete state model with local similarity to PDB structures, 
each residue is simply assigned a discrete state most similar 
to its local (a,r) angle in the PDB structure. 

Clustering amino acids by first order state 
transition propensity. We reduce the alphabet of 
twenty amino acids to a simpler alphabet based on their lo- 
cal geometric properties. Simplifying the residue alphabet 
by geometric properties helps to improve characterization 
of the local relationship between sequence and backbone 
structure, and alleviate the problem of inadequate data for 
deriving empirical potential functions. 

A protein structure can be represented uniquely by a se- 
quence of (a, x) pairs, where a is amino acid residue type 
and x is the discrete state. For four discrete state and 20 
amino acid types, the total number of possible descriptors 
at one residue position is 4 x 20 = 80. For simplicity, 
we use s £ [1...80] to represent the state (discrete state 
and amino acid type) a residue may take. We define first 



order state transition probability as: p Sl , S2 = p[s2|si] = 
p[(a,2,X2)\(ai,Xi)]. First order state transition geometric 
score can then be defined as: s„ = In Ps i' 3 2-° ! ' 3 w here 

Ps 1 ,s 2 ,exp 

Ps!,s 2 ,obs is the observed probability of transition from state 
si to state s 2 , andp SltS2 _ exp is the expected transition prob- 
ability from state si to s 2 for a random distribution, which 
is the product of the relative frequency of ai, x\, and 
X2- We cluster twenty amino acids using their first order 
state transition score. Specifically, we define the distance 
between two amino acids as the distance between the cor- 
responding row vectors in the state transition matrix (Fig- 
ure 2b). 

This clustering approach based on local sequence- 
structure relationship provides a useful method to group 
amino acids and to obtain a simplified protein alphabet. 
For this study, we take a simplified alphabet of 5 letters: 
{C,I,V,L,M,W,F,Y}, {A,E,K,Q,R}, {G,S,H,T}, {P}, and 
{D,N} as our simplified alphabet. Each residue position 
now has 4 x 5 = 20 possible descriptors: there are 4 dis- 
crete (a, t) states and 5 types of amino acids. 

The representation of a protein structure is now a vec- 
tor that includes both contact interaction (with 210 dif- 
ferent types of contacts) and local geometric score (with 
20 x 20 = 400 parameters for each of the 4 x 5 geometric 
descriptors for two consecutive residues). 

Combining contact and local geometric de- 
scriptors. The form of the scoring function in this 
study is a weighted linear combination of all descriptors: 
H(c) = w ■ c, where c is the descriptor vector, w is the 
weight vector. We obtain weight vector w using optimiza- 
tion method. For a linear scoring functions, the basic re- 
quirement based on Anfinsen experiment is: 

w ■ (c/v — Co) + b < 0, where b > 0, 

where cm and C£> are descriptor vectors for native and de- 
coys structures of a specific protein, respectively, and b is 
the energy gap required between a native and decoy struc- 
ture. Each pair of native descriptor vector and decoy de- 
scriptor vector provides one inequality constraint. Together 
many such constraints define a convex polyhedron V where 
feasible solutions w vectors are located. For nonempty V, 
there could be infinite number of choices of w, all with 
perfect discrimination. The optimal weight vector w can 
be found by solving the following quadratic programming 
problem: 

Minimize 2 "H' U, I| 2 W 

subject to w ■ (cjv - c D ) + b < for all N and D. (2) 

The solution maximizes the distance of the plane 

(w, b) to the origin. Here, we use a support vector machine 
to find such an optimal w. 

We take 980 non-homologous proteins from the PDBS- 
ELECT database. Among these, 652 proteins are randomly 



Table I. The number of mis-classifications using CLGP and 
other residue based potential functions. C a : Computation 
of potential function need only C a atoms. SC: Computation 
of potential function needs side chain center. AT: Computa- 
tion of potential function needs all-atom representation. 
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Functions 


Complexity 


Mis-classified 
Proteins 


CLGP 




4/328 


TE13 


SC 


7/194 


BV 


AT 


2/194 


MJ 


SC 


85/194 



chosen as training set, and the remaining 328 proteins are 
used as testing set. We first use gapless threading to gen- 
erate >10 millions of decoys to train and test our contact 
and local geometric potential function (CLGP). This is then 
complemented by training using explicit decoys obtained 
from [5]. 

3. RESULTS 

Performance on gapless threading decoys. The 
performance of the potential function on gapless threading 
decoys is listed in Table I. Among 328 test proteins, only 
4 proteins are misclassified. The accuracy of CLGP is near 
99%. We compare CLGP with several other residue based 
potential functions, including those developed by Tobi & 
Elber et.al. (TE13) [6], Miyazawa & Jernigan (MJ) [7], and 
Bastolla & Vendruscolo et. al. (BV) [8]. Although these 
potential functions are residue based potential function, they 
still require all-atom representation since they either need to 
calculate the side chain centers or need to compute explicit 
atom-atom contacts. CLGP is the only potential function 
applicable to representations with only C a atoms. It has 
better performance than other residue based potential and is 
comparable to that of all-atom potential. 

Performance on other decoy sets. Potential 
function trained using gapless threading decoys usually do 
not works well for more sophisticated decoys generated by 
other methods. To further train the CLGP potential function, 
we include additional decoy set generated by Loose et. al. 
[5], which contains > 80, 000 decoy structures generated by 
energy minimization protocols for a set of 829 proteins. The 
new CLGP potential function is tested using the Park-Levitt 
decoy set create in [2]. The performance of our potential 
function on several decoy sets is shown on table II. We also 
compared our results with three other residue-based poten- 
tial function, namely, TE13, LHL (Li, Hu, & Liang [9]) , 
and MJ. Performance of CLGP in general is better or com- 
parable to other potential functions. It performs better than 
other potential functions on LMDS decoy set. 



Table II. Performance of CLGP for ParkLevitt decoy set. 
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4. SUMMARY AND CONCLUSION 

In this study, we have developed an effective potential 
function for simplified representation of protein struc- 
tures. By fc-mean clustering, we obtained accurate dis- 
crete states and demonstrated the accuracy of model gen- 
erated by these states using a modified build-up algorithm. 
We then simplified amino acid alphabet using their local 
sequence-structure relationship. The first order state tran- 
sition propensity is constructed for this purpose. This po- 
tential function combines both contact interaction and local 
sequence-structure relationship. Contact interaction corre- 
lates well with more global topological information, while 
local sequence-structure relationship contains local geomet- 
ric information. We find native structures can be well sta- 
bilized against decoy structures. The performance of this 
potential function is better than or comparable to other po- 
tential functions but it requires significantly less detailed de- 
scription. We expect this potential function to be useful in 
protein structure prediction and fold simulation of simpli- 
fied protein models. 
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